Do statistical heterogeneity methods impact the results of meta- analyses? A meta epidemiological study

Background Orthodontic systematic reviews (SRs) use different methods to pool the individual studies in a meta-analysis when indicated. However, the number of studies included in orthodontic meta-analyses is relatively small. This study aimed to evaluate the direction of estimate changes of orthodontic meta-analyses (MAs) using different between-study variance methods considering the level of heterogeneity when few trials were pooled. Methods Search and study selection: Systematic reviews (SRs) published over the last three years, from the 1st of January 2020 to the 31st of December 2022, in six main orthodontic journals with at least one MA pooling five or lesser primary studies were identified. Data collection and analysis: Data were extracted from each eligible MA, which was replicated in a random effect model using DerSimonian and Laird (DL), Paule–Mandel (PM), Restricted maximum-likelihood (REML), Hartung Knapp and Sidik Jonkman (HKSJ) methods. The results were reported using median and interquartile range (IQR) for continuous data and frequencies for categorical data and analyzed using non-parametric tests. The Boruta algorithm was used to assess the significant predictors for the significant change in the confidence interval between the different methods compared to the DL method, which was only feasible using the HKSJ method. Results 146 MAs were included, most applying the random effect model (n = 111; 76%) and pooling continuous data using mean difference (n = 121; 83%). The median number of studies was three (range 2, 4), and the overall statistical heterogeneity (I2 ranged from 0 to 99% with a median of 68%). Close to 60% of the significant findings became non-significant when HKSJ was applied compared to the DL method and when the heterogeneity was present I2>0%. On the other hand, 30.43% of the non-significant meta-analyses using the DL method became significant when HKSJ was used when the heterogeneity was absent I2 = 0%. Conclusion Orthodontic MAs with few studies can produce different results based on the between-study variance method and the statistical heterogeneity level. Compared to DL, HKSJ method is overconservative when I2 is greater than 0% and may result in false positive findings when the heterogeneity is absent.


Introduction
The publication of systematic reviews (SRs) in orthodontics has increased exponentially in recent years [1,2].A properly conducted SR requires many rigorous steps, starting with a comprehensive search for the available evidence, selecting the eligible studies, extracting data, and finally synthesizing the data [3,4].To guarantee valid results of an SR, authors should follow a proven rigor SR methodology that will maximize the proper consideration of the available evidence.Researchers generally use a meta-analysis (MA) to synthesize data from individual studies and provide the reader with summarized quantitative results.MA can facilitate interpreting the results by providing a cumulative effect measure.Selecting an appropriate statistical model is paramount for the validity of the MA.
Two main statistical models are applied in the MA: the fixed effect or the random effects model [5].The fixed effect model assumes that all included studies in MA have the same true effect size, while the random effects model considers the effect size variation between the pooled studies in MA.In other words, the selection between the two models depends on the expected or measured effect size variation from study to study, which may result from differences (heterogeneity) in the participants' characteristics (age, gender, ethnicity, type of malocclusion, crowding level, etc.) and the implementation of the intervention (type, dose, activation of appliance, wearing time of appliance, follow up, etc.).The fixed effect model ignores the between-study heterogeneity or assumes it is nonexistent.In contrast, the random effects model quantifies the degree of statistical heterogeneity in MA and gives relatively larger weight to smaller studies than the fixed mode [6,7].Random effects model provides identical results to the fixed effect model if there is no heterogeneity.The random effects model implements different estimators for the between-study variance [3] to calculate confidence intervals in MA.
The simplest and most commonly applied between-study variance estimator is the DerSimonian and Laird (DL) approach [8,9].DL method is the default option in many popular statistical software packages such as Comprehensive Meta-analysis (CMA) [10] and Review Manager (RevMan) [11].However, DL may lead to false positive results, particularly when the number of studies is small [12], the studies are unequal in size [13], and the heterogeneity is increased [14].Many alternative methods exist to overcome the limitation of the DL method; the Paule-Mandel (PM) method [15], the Restricted maximum-likelihood (REML) method, the Hartung and Knapp [16] and the Sidik and Jonkman [17] (HKSJ) method.Many simulation studies [13,[18][19][20] investigated the effect of different between-study variance estimators on the pooled estimate, but the findings conflict.A previous study recommended the REML and HKSJ over the DL methods [19], while PM was recommended in three simulation studies reported in a systematic review [21].
A previous study [22] revealed that approximately 65% of MAs in orthodontics have less than five trials.The potential impact of selecting a between-study variance estimator increases with smaller samples [23].To address these shortcomings of pooling few trials in MAs in orthodontics a recent study investigated the use of the heterogeneity estimator corrected by Hartung Knapp on MAs in orthodontics [24].Still, the authors included a wide range of pooled studies in the MA (from 3 to 45 per MA).This large variation in the primary studies can diffuse the true impact on smaller samples.Keeping this in mind, the present study aimed to investigate the impact of different between-study variance estimators, the degree of heterogeneity, and the sample size equality of pooled studies on the result of MAs in orthodontics with fewer trials (less than 5).

Materials and methods
The reporting of this methodological study followed the reporting guidelines for methodological studies [25] which was adapted from Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA).

Eligibility criteria
SRs were included if they met the following criteria: • Orthodontic SRs published between the 1 st of January 2020 and the 31 st of December 2022 in six journals: Progress in Orthodontics (PIO), European Journal of Orthodontics (EJO), Orthodontics & Craniofacial Research (OCR), American Journal of Orthodontic and Dentofacial Orthopedics (AJODO), The Angle Orthodontist (AO), and Korean Journal of orthodontic (KJO).
• These SRs should have included at least one meta-analysis with less than five primary studies reporting interventional procedures.
SRs of animal or laboratory trials, network meta-analyses, and MAs of incidence or prevalence were excluded.Scoping reviews, overviews, guidelines, and qualitative SRs were also excluded.

Search and study selection
One author (xx) performed an electronic search in Medline via PubMed using MeSH terms and text word and journal websites to collect the relevant records (S1 Table ).Two authors (xx, xxx) screened the titles and abstracts of the retrieved studies independently and in duplicate.Systematic reviews with meta-analyses related to orthodontics were included initially without consideration to the number of primary studies.Once the full version was obtained, the number of included primary studies per meta-analysis was evaluated.A discussion was completed until a consensus was reached in case of disagreement.

Data collection process
The authors extracted the following characteristics of meta-analysis: the journal title, the effect measure (mean difference, Risk ratio, odds ratio, standardized mean difference), the MA model implemented (random, fixed), the number of included primary studies, the number of participants, the between-study variance estimator utilized (DL, REML, PM, HK, SJ), and the statistical software used.
For eligible meta-analysis; the number of each study group, mean, and standard deviation for continuous data, and number of events and non-events for dichotomous outcomes were extracted from the forest plots and entered in a Microsoft excel1 (Microsoft, Redmond, Washington, USA) file for data presented in arm-level format.The mean, standard error or upper and lower confidence intervals were extracted if data were presented in contrast-level format.

Statistical analysis and data synthesis
The characteristics of the included meta-analyses were summarized using median and interquartile ranges for continuous variables and frequencies for categorical variables.Each metaanalysis was replicated by one author (xx) using four different heterogeneity estimators (DL, REML, PM, HKSJ) available in Stata 15.1 software (Stata Corp, Texas, USA).The results of each replicated MA were compared to the original MA results using the same scenario to avoid the error in data extraction.Then from each MA output we collected; the overall effect size, the corresponding 95% confidence interval (CI), the P-value, and the level of statistical heterogeneity presented by I 2 , Q, and Tau 2 .Furthermore, the ratio between the confidence intervals of each estimator and the confidence interval of DL was calculated using the following equation: For the equality of trials' size, a percentage between the largest study and the smallest one in MA was calculated and if up to 20% the studies were considered to be equal to some extent, between 21-50% moderately unequal, between 51-100% unequal, and >100% substantially unequal.
The significance of the meta-analyses with the different heterogeneity estimators, P values, and confidence interval ratios for the different estimators were calculated and divided according to the statistical heterogeneity level expressed by I 2 statistics.The results were analyzed and visualized using the R statistical package (version 4.3.0)(R Foundation for Statistical Computing, Vienna, Austria).Finally, the feature selection algorithm Boruta was used to identify significant predictors associated with the significant change in the confidence interval from the DL method, and this was applicable to the HKSJ method.

Meta-analyses characteristics
The selection process of the studies in the CONSORT format is illustrated in Fig 1 .The excluded studies and the reasons for exclusion are clarified in the S2 Table .A total of 146 meta-analyses with few studies were analyzed.The vast majority of the included MAs (n = 111; 76.03%) were conducted using the random effect model, and the output was expressed using the mean difference (n = 121; 82.88%).Likewise, most of the included MAs did not report the used heterogeneity estimator in their analysis (n = 99; 83.19%).Nevertheless, the DL method was reported in a few MAs (n = 14; 11.76%), and the other methods, such as HK, SJ, and REML, were barely reported.The median number of studies was 3 (IQR: 2 to 4), with a median number of participants 139 (IQR: 99, 249) per MA.The size of the studies per MA was equal in only 18% (21/146).The most frequently used software was RevMan (n = 112; 76.71%), followed by Stata and CMA software.
The heterogeneity measures varied as for I 2 it ranged from 0 to 99% with a median of 68% (interquartile range (IQR): 0 to 78), for Q it ranged from 0 to 224.5 with a median of 2 (IQR: 0, 8), while for t 2 (tau 2 ) it ranged from 0 to 75000 with a median of 0.0007 (IQR: 0 to 1) (Table 1).

Replicating meta-analyses using the four heterogeneity estimator methods
The 146 MAs were replicated using the four heterogeneity estimators; DL, REML, PM, and HKSJ.(Table 2).The overall effect estimate did not change using the four mentioned estimators in MA.However, the change was obvious in the confidence interval's (CI) width and, subsequently, the findings' P-value and significance.REML and PM showed a slightly different confidence intervals and significance level.In contrast, HKSJ showed wider CI than DL, which was mostly double the width of DL confidence interval up to 6 times when I 2 was greater than 0. Also, HKSJ yielded wider confidence intervals when I 2 was 0, but this increase was smaller than MAs with higher I 2 levels.Interestingly, the confidence Interval using the HKSJ method was wider in all meta-analyses when I 2 was greater than 0, but it was narrower in (30/71; 42.25%) meta-analyses with a lack of heterogeneity (I 2 = 0) (Table 2 and Fig 2).
In regards to P-values and significance of the results, more than half (21/35; 60%) of significant findings of MAs with a statistical heterogeneity (I 2 >0) using the DL method became nonsignificant using HKSJ.On the other hand, (7/47;30.43%) of non-significant results became significant, and (8/24;16%) of the significant results became non-significant when HKSJ was used instead of the DL method in meta-analyses with a lack of heterogeneity (I 2 = 0).In terms of other methods, there was a change in one significant result using the PM method (Table 3 and Fig 3).
The Boruta algorithm is a feature selection algorithm which measures the importance of each variable with respect to the outcome.The Boruta algorithm identified the heterogeneity measures (Q, I 2 , and t 2 ) and the number of studies as significant predictors for the change in HKSJ confidence interval width compared to DL.In other words, the change in CI HKSJ/DL ratio was affected by the number of included studies in MA and the value of the statistical heterogeneity measures (Fig 4).

Discussion
The DL method in random effect meta-analysis is a commonly used for calculating the confidence interval in orthodontic MAs.Although there was a lack of reporting of the betweenstudy variance method in 83% (99/119) of random meta-analysis, 98% (97/99) of those MAs were conducted using RevMan or CMA, where the default method of between-study variance is DL.Previous simulation studies [26][27][28] reported that the DL method has a high rate of false positive results.Likewise, two methodological studies [13,29] found a considerable number of false positive results in Cochrane MAs with significant findings using the DL method.The current investigation replicated 146 MAs with few primary studies less than 5 using four betweenstudy variance methods (DL, REML, PM, HKSJ) and found that (30/69; 43.47%) of the significant MAs became non-significant after applying HKSJ.This finding confirms previous empirical assessments [13,29], which investigated Cochrane MAs, and revealed that greater significant results resulted from methods based on the normal distribution (such as DL) than methods based on t-distribution (such as HKSJ).Likewise, a simulation study found that t-distribution produces a wider confidence interval than the normal distribution, especially when   the statistical heterogeneity and the number of studies are small in meta-analysis [30].This may rias concerns related to the overcoverage and loss of power [31].The aforementioned information my explain our findings, as the confidence interval of MAs using HKSJ was double the length up to approximately 7 times the length of the confidence interval when DL was used, and the heterogeneity was present (I 2 >0%).Hence, HKSJ is over-conservative and may result in non-significant results when they are actually significant.A strength of this study is comparing the results of the different methods based on the heterogeneity level.
It is also important to mention that 30.43% (7/40) of non-significant MAs using the DL method became significant with HKSJ when the statistical heterogeneity was absent (I 2 = 0%).This can be supported by the finding that the confidence interval was narrower in 42.25% (30/ 75) of MAs using HKSJ than that using DL when the statistical heterogeneity was absent (I 2 = 0%).(Fig5) These findings were consistent with a previous study [32] which replicated the analysis of 157 MAs with binary data using HK and DL and concluded that HK may yield a narrower confidence interval and smaller P-values than DL in some homogeneous meta-analyses.Consequently, HKSJ is not always conservative [31] and may also be imperfect and may result in false positive results when heterogeneity is absent.A previous study [31] suggested to avoid Hartung-Knapp method When the heterogeneity is absent (t 2 = 0) due to producing a modified results with shorter confidence intervals and smaller P values.
The Boruta algorithm found that the number of studies and the heterogeneity measures are significant predictors for the change in HKSJ confidence interval width.However, the model rejected the effect of equality of trials' size on the confidence interval change between HKSJ and DL methods.In contrast, a previous study [13] found that equal-sized trials may decrease the change in type I error rate even if there is a moderate heterogeneity.In our study, the high number of unequal-size trials, the large difference in trials' size in the individual MA, or the very few trials in MA might render the effect of the size equality trivial.
Simulation studies [33] showed that REML and PM methods are more robust than the DL method.They recommended the PM method for both continuous and binary outcomes and the REML method for continuous outcomes [34] even for large heterogeneity (t 2 ).A recent study found similar results for DL, REML, and PM, especially when the heterogeneity was absent.This is consistent with the findings of Chung et al. [35], who found that DL and REML produce similar findings when the number of studies is small.
A previous investigation [24] included a wide range of pooled studies in a meta-analysis and assessed the results using eight different heterogeneity methods with and without HK adjustment.The authors concluded that "meta-analysis with at least three studies is sensitive to HK correction".However, we found that in meta-analysis with few studies (fewer than 5), HKSJ is overconservative when I 2 is greater than 0%, while HKSJ may result in false positive results when the heterogeneity is absent (I 2 = 0%).Although making a clear recommendation about the best methods is difficult in case of few studies, we suggest using DL method for pooling a small number of primary studies (fewer than 5) in MAs when I 2 = 0% and sensitivity analysis using both HKSJ and DL for pooling MAs when I 2 >0.

Clinical implications
The clinical implications of this study are important for MAs focused on orthodontic samples.As many MAs related to orthodontics include small samples of primary studies, selecting the between-study variance methods when conducting the MAs can determine whether the identified differences are significant.It is recommended that MAs related to orthodontics that include small samples of primary studies should report using more than one between-study variance method (sensitivity analysis), especially when the confidence interval is close to the significance/non-significance cut-off point.

Limitations
The search period was restricted to three years, but the study aimed to collect MAs rather than only SRs and to map a problem rather than providing a more robust estimate.Although relatively short, this period probably expresses a reality closer to what has been investigated nowadays.MAs with only two to four studies were included in this study to concentrate on MAs with fewer primary studies.Finally, our assessment did not investigate the effect of betweenstudy variance methods on the prediction intervals because of approximately more than onethird of the sample (53/146; 36.3%) had only two studies.

Conclusion
This sample of MAs with less than five studies appears to be sensitive to the selected betweenstudy variance method.HKSJ doubled the confidence interval of the pooled estimate when the heterogeneity was present.However, HKSJ reduced the confidence intervals of 30% of MAs when the heterogeneity was absent, leading to more significant differences when compared to the DL method.

Fig 2 .Table 3 .
Fig 2. Boxplot depicts the different ratios of confidence intervals for the three heterogeneity estimators; HKSJ/DL, PM/DL, and REML/ DL.The median of REML/DL and PM/DL is almost equal to one.The median of HKSJ/DL ratio is slightly greater than two when I 2 = 0%, and it is slightly greater than one when I 2 = 0%.It is worth noting that almost one-third of the boxplot represents less than one.https://doi.org/10.1371/journal.pone.0298526.g002